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A variety of genome transformations can occur as a microbial population adapts to a large en- 
vironmental change. In particular, genomic surveys indicate that, following the transition to an 
obligate, host-dependent symbiont, the density of transposons first rises, then subsequently declines 
over evolutionary time. Here, we show that these observations can be accounted for by a class 
of generic stochastic models for the evolution of genomes in the presence of continuous selection 
and gene duplication. The models use a fitness function that allows for partial contributions from 
multiple gene copies, is an increasing but bounded function of copy number, and is optimal for 
one fully adapted gene copy. We use Monte Carlo simulation to show that the dynamics result in 
an initial rise in gene copy number followed by a subsequent fall-off due to adaptation to the new 
environmental parameters. These results are robust for reasonable gene duplication and mutation 
parameters when adapting to a novel target sequence. Our model provides a generic explanation 
for the dynamics of microbial transposon density following a large environmental changes such as 
host restriction. 

PACS numbers: 87.10.-e, 87.10.Mn, 87.23.Kg 
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I. INTRODUCTION 

Biological evolution involves a variety of genome trans- 
formations, which include point mutation, homologous 
recombination, gene duplication, and horizontal gene 
transfer [lH3 , and which involve a variety of mobile ge- 
netic elements IIHS]. Understanding the influence that 
each of these mechanisms and elements exerts on the 
process of evolution is one of the current frontiers in bi- 
ology 0- Transposons are one such genetic element — 
capable of copying and pasting segments from one loca- 
tion to another within a genome, they provide vehicles 
for gene duplication [51 E] ■ 

Transposons are copied and inserted across genomes 
through a variety of mechanisms [H [9] . Non-conservative 
transposons multiply within a genome by replicating 
themselves elsewhere. Many code for the proteins that 
copy and insert themselves throughout the genome. 
Other transposable elements are more passive, relying on 
proteins from other transposons in order to proliferate. 

Insertion sequences, or IS elements, are a particular 
type of non-conservative transposon, which contain their 
own proteins for replication but typically do not contain 
any additional proteins [101 E]. However, when two IS 
elements are near each other along a genome — for ex- 
ample, flanking both sides of a gene — they may form a 
composite transposon that includes the host genes sand- 
wiched between the IS elements [H [10] . These compos- 
ite transposons have been associated with the evolution 
of virulence factors that affect infection and severity of 
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diseases [HI |T3] , and have been implicated in large chro- 
mosomal rearrangements |14| . 

Due to the conserved tendency of the genes required 
to replicate and insert an IS element, it is possible 
to estimate the IS element density within a genome 
through sequence analysis. In order to probe the evo- 
lutionary dynamics of microbial genomes, Moran and 
Plague estimated the IS density in Bacteria following 
host-restriction, the transition to becoming an obligate, 
host-dependent, organism such as a gastrointestinal sym- 
biont 15]. They found that on average initial host re- 
striction was followed by a sharp increase in IS density 
but that at long times the IS density generally declined 
to near zero. This overall pattern has been more directly 
validated in a few taxonomic lineages by tracking the 
evolution of particular closely related genomes [TBI [H] . 
This pattern may be sufficiently widespread, and if so, 
it should have a generic explanation, independent of the 
particular organisms or environment, and this is what we 
seek to provide in the present article. 

In order to explain the trend in IS density following 
host-restriction, we focus on the role of IS elements as 
vectors for gene amplification through their roles in com- 
posite transposons [51 jTU] . Several studies have indicated 
the importance of gene amplification in the rapid evo- 
lution of Bacteria [HI [19] . Duplicated genes provide a 
basis for the evolution of novel function [Ml |5T], and 
have been implicated in the evolution of new organismal 
forms TI and lineages [23] . Gene duplication events have 
been invoked in medically important traits and diseases 
in humans 24 , including various forms of cancer [551 [2S] ) 
as well as in the expansion of gene families such as the 
globins I27j and the DNA replication processivity com- 
plex subunits [SH] • Even whole genome duplications have 
been observed in many organisms including yeast [29j . 
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FIG. 1; (Color online). Schematic: Gene activity following a 
change in environmental conditions. Each gene gives rise to 
a gene product that has a number of multiple activities that 
effect organismal fitness differently. Some of these activities 
are beneficial, while others are deleterious. In general, the 
selected genes tend to maximize the benefit while minimizing 
any harmful "side-effect". However, when an organism un- 
dergoes a change in its environmental conditions, previously 
deleterious activities may become beneficial and vice-versa. 



small flowering plants [301 , and pufferfish [3T1 [31] . 

If gene duplications confer an adaptive advantage to 
their hosts, then one might expect a concomitant pro- 
liferation in the IS elements that provide the mecha- 
nism for gene duplication — especially when adapting to 
a new environment. Conversely, in relatively consistent 
environments gene duplications confer no advantage to a 
well-adapted organism and we might anticipate selection 
pressure for a decrease in IS density. Thus, our strategy 
for interpreting the genomic trends reported by Moran 
and Plague [TS] is to better understand the evolutionary 
dynamics of gene duplication, and thence to infer the 
corresponding dynamics of the IS elements. 

In order to probe the link between gene duplication 
and adaptation to a novel environment, we model genome 
dynamics with gene duplication. Previous evolutionary 
modeling has largely focused on the process of mutation 
and recombination [33J [M] , large scale genome duplica- 
tions |35l [36] , or static features such as copy number dis- 
tribution |37j. Here we quantify a continuous selection 
mechanism for the evolution of novel genes put forth by 
Bergthorsson et al [21 . In our model, we consider a pro- 
tein encoded by a gene that has multiple activities — for 
example, enzymatic activity or non-specific binding. As 
shown schematically in Fig. [T] the initially deleterious 
effect of a gene product may subsequently become bene- 
ficial to the organism when exposed to different environ- 
mental conditions, as documented, for example, in the 
growing body of literature on non-specific interactions 
involving proteins |38j . 

There are then two mechanisms by which an organism 
may increase a particular protein activity — efficiency and 



Free Living: IVloderate IS Density on Average 



Decreasing IS density / ft 



Increase in 
IS density 



Decreasing IS density 



Increase in 
IS density 



Host- Restricted: Boom-Bust IS Density Cycle 



-i 



Increase in 
IS density 



Decreasing IS density / * ) Decreasing IS density 



@l 



FIG. 2: (Color online). Schematic: Adaptation to a chang- 
ing environment. We consider organisms experiencing a large 
change in environmental conditions. Due to the role of gene 
duplications in adapting novel functions, these changed con- 
ditions result in an enhancement of the IS element density 
within the genome. Free-living bacteria experience a fluctu- 
ating environment, which results in the maintenance of IS 
density. Host-restriction represents a large change in environ- 
mental conditions, resulting in an initial boom in IS density, 
such as that seen in the recent obligate organisms. However, 
a host also represents a stable and consistent environment. 
The lack of any need for rapid adaptation leads to near-zero 
IS density by a process of slow decay. 



expression. In the flrst mechanism, a gene may undergo 
mutations that result in a more effective protein, i.e., a 
gene with some small activity which is favorable to the 
organism can undergo mutations that subsequently en- 
hance that activity. In the second mechanism, increasing 
the level of gene expression — for example, by creating 
additional copies of a gene within the genome — also pos- 
itively impacts the total activity of a given gene product. 

The purpose of this work is to show that adaptation 
to a large environmental change provides a sufficient ex- 
planation for both the short term increase and the long 
term decrease in IS density following host restriction — 
as outlined in Fig. [2] The rapid adaptation to a new 
environment results in a large number of gene duplica- 
tions, presumably involving IS elements. Likewise, the 
long term consistency of the host environment leads to 
a decreased number of duplicates and near-zero IS den- 
sity. We build a quantitative model of the mechanism for 
adaptation via gene duplication 21 , and show that this 
model accounts for the gross characteristics in IS density 
following host restriction [TS]. 

This paper is organized as follows. In Section [TT| we 
present our quantitative model of Bergthorsson et a/.'s 



proposed mechanism for the emergence of novel genes un- 
der continuous selection [2lj . Results of our simulations 
are presented in section [Till Finally, in section |Vlj we de- 
scribe the biological interpretation of our work, showing 
how our results are consistent with the trends observed 
by Moran and PlaguefTS], 



II. MODEL OF CONTINUOUS SELECTION 
WITH GENE DUPLICATION 



We consider a population of replicating cells whose 
replication rates depend on the genes within their 
genomes. We express the probability of the fcth cell repli- 
cating according to a fitness function T . 
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where n represents the number of total genes in the 
genome, fj, denotes the fitness penalty arising from each 
additional gene, and gj denotes the positive fitness contri- 
bution of the jth gene within the genome. The parame- 
ter T] represents an assumed additive Gaussian noise with 
standard deviation d. Introducing additive noise allows 
us to vary the strength with which the organismal fitness 
is coupled to the genome, allowing us to take into account 
the effect of environmental fluctuations as well as muta- 
tion in the parts of the genome that are unrepresented 
in this simple model. Note that the decoupling noise we 
have introduced here differs from the demographic noise 
intrinsic to a population [39' . In section VI B we will dis- 



cuss the functional form of the noise in more detail. The 
value of g depends inversely on the Hamming distance d 
between the gene sequence S\ and some target sequence 
T, i.e.. 
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where N represents the number of letters in each se- 
quence (both S\ and T), d{S\,T) represents the Ham- 
ming distance between S\ and T, and Sj^ and T, repre- 
sent the ith letter of S\ and T, respectively. 

These parameters are intended to caricature a biologi- 
cal process whereby a novel beneficial functionality cap- 
tured by target sequence T would provide an overall im- 
provement in fitness given by gj{S\=T) = 1. Partially 
matching sequences also provide some of the catalytic ac- 
tivity necessary for the new function, yielding a partial 
benefit in proportion to the homology between the gene 
sequence S\ and the target sequence T, in accordance 
with the continuous selection model of Bergthorsson et 
al. [51]. The total benefit of a set of genes must also be 



less than the maximum benefit arising from a single gene. 
Therefore, each gene has a fixed cost /x that represents 
the deleterious effect of non-specific interactions of the 
product coded for by a gene. For simplicity, we assume 
that each gene in the genome is expressed equally without 
regard to the more detailed considerations of gene reg- 
ulation. This simplification has its basis in recent work 
that has indicated that gene copy number is positively 
correlated with gene expression level [JD] . 

The scheme outlined above clearly sets the optimum at 
a the one gene solution, with the gene matches the target 
sequences. This assumption comes from the biological 
considerations of the need to minimize the deleterious 
non-specific interactions while reaching a certain level of 
functional activity for a given biomolecule. Note that 
while the optimum is chosen by design, the behavior of 
the system as it evolves in time is not. We do not a 
priori know whether the optimum behavior will be to 
duplicate or to not duplicate. After all, the long term 
advantage is for the single gene case, and it might be 
superfluous to duplicate genes only the then try to reduce 
their number. Furthermore, if we introduce high gene 
duplication rates, will competition suffice to overcome the 
duplication rate and drive the reduction in the number 
of genes? Or will the behavior of the system depend on 
the careful tuning of these parameters? Constructing a 
simple model enables us to answer these questions and 
to probe the viability of the scheme we outline in Fig. [2] 

Genes are allowed to evolve by spontaneous point mu- 
tation, internal duplications, or deletions. Spontaneous 
mutation occurs by replacing a letter at a particular po- 
sition within a gene, chosen at random from a uniform 
distribution, and replacing that letter with a randomly 
chosen one from an alphabet of size c. Gene duplica- 
tions and deletions, just as with spontaneous mutations, 
occur on randomly chosen genes within the population. 
Duplications are modeled as insertion events, and do not 
overwrite existing genes. Fig. [3] outlines the processes 
that lead to gene duplication and deletion. 

Initially, all organisms contain a single identical copy 
of a randomly chosen gene. At each time t, the fitness of 
a random cell k is selected out of a fixed population size 
and its fitness J^k from Eq. [l]is measured. This number 
then defines the probability that this cell will replicate 
at time t. If it is decided that the organism replicates, it 
then overwrites a different randomly chosen organism in 
the population — the so-called roulette scheme [H]. Thus, 
organisms are on average being selected for higher repli- 
cation rates (defined by J^) . One generation is defined as 
the time it takes for half the total population to undergo 
a growth attempt. A random update scheme governs 
which organisms will attempt to replicate. 



III. SIMULATION RESULTS 



We characterize the model of continuous selection with 
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FIG. 3: (Color online). Schematic: gene deletion and du- 
plication. (Upper) Deletion can occur via homologous re- 
combination or mismatch repair between nearly identical IS 
elements by using the IS elements as templates for homolo- 
gous recombination. (Lower) Gene duplication occurs when 
a composite transposon that is made up of two flanking IS 
elements replicates and inserts itself in a different part of the 
genome. 



gene duplication described in Section ITT] and contrast it 
with a model of continuous selection with no gene dupli- 
cation. In order to do so, we assign a randomly chosen 
target sequence T and a randomly chosen gene sequence 
that is initially fixed in the population (i.e., zero diversity 
at i = 0). We then allow the system to evolve under a 
selection pressure described by Eq. (IT]). 

As shown in Fig. ]4] the average number of genes in a 
genome (n) increases sharply at the onset through gene 
duplication. As time passes and the individual genes 
become better adapted toward the target sequence, the 
number of genes then begins to decrease. In the long 
time limit, the gene number seemingly asymptotes to a 
small number greater than 1. 

These changes in gene number are accompanied by 
changes in the individual gene scores or gene fitness g. 
Fig. Js] plots the average organismal fitness (J") as a func- 
tion of time and confirms that gene duplication is en- 
hancing the initial rate of adaptation (fitness increase) 
by providing a means for the organism to amplify the 
benefit of a gene. 

In contrast to the case of organismal fitness, average 
gene score or fitness (g) does not necessarily increase with 
gene number N. In principle, there is a tension between 
the greater mutation rate that larger copy number en- 
genders and the lesser effect on fitness from an individual 
gene. By plotting the average gene score (g) in Fig. ]6] we 
see that, on average, genes initially adapt faster toward 
the target sequence T with gene duplication. It thus ap- 
pears that the primary effect comes from mutation of the 
gene copies provided by gene duplication, which leads to 
additional diversity in comparison to the case without 
gene duplication. 

Gene duplication enhances the effect of point muta- 
tion by amplifying its effect. While the effect of a point 
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FIG. 4: (Color online). Average gene number (n) with and 
without gene duplication for simulations with the fitness func- 
tion given by Eq.[T] The average is given by the darker lines, 
while the lightly shaded areas represent the area covered by 
the average standard deviation. In the initial phases of adap- 
tation, gene duplication dominates as the primary mode for 
enhancing average fitness. As time passes, the slower mode of 
adaptation provided by sequence mutation refines the genes 
and the average number of genes per genome decreases. In 
the case with no gene duplication or deletion the gene number 
remains constant. Simulations were carried out with a dupli- 
cation rate of 1 per generation, a gene deletion rate of 0.2 
per gene per generation, a mutation rate of 0.01 per gene per 
generation, A^ = 10, c = 10, /i = 0.05, and d = 0.2. We con- 
sidered a population of 10000 organisms and averaged across 
100 replicate simulations with the same parameters but dif- 
ferent initial seeds. In the case without gene duplication, we 
set the gene duplication and deletion rate to without change 
to the other parameters. 



mutation on a single gene may relatively small when com- 
pared to the noise ry, when that point mutation is dupli- 
cated numerous times so is it's effect on the fitness of 
the organism. This can mean the difference between be- 
ing a mutation that is effectively washed out by noise or 
strongly selected. Fig. J7] shows that the relative speed- 
up from gene duplication becomes more dramatic as the 
magnitude of the noise increases. Notice how the point 
at which the single gene case crosses over the gene du- 
plication case shifts further and further to the right with 
increasing d. 

These results support the proposed mechanism for evo- 
lution of novel proteins proposed by Bergthorsson et 
al 1211 . In particular, the primary steps of gene duplica- 
tion to enhance expression, followed by the slower muta- 
tion and selection of gene with better catalytic properties 
then the original, and finally reduction of gene copies all 
appear to have been captured by this simple model. Note 
that in the long time limits we tested, the average gene 
number (n) remained above unity, probably reflecting the 
fact that there is a low probability of simultaneous ben- 
eficial mutations that would allow for a favorable gene 
deletion. Also, the standard deviations in Figs. ]4] J5] 
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FIG. 5: (Color online). Average organismal fitness {T) with 
and without gene duplication. With gene duplication the ini- 
tial rate of adaptation is faster then in the single gene case. 
However, at long times the single gene case results in genes 
that are closer to the target sequence T. Parameters are the 
same as given in Fig. H] 



FIG. 7: (Color online). Average gene fitness ((?) with and 
without gene duplication for noise parameter d = 0.0, 0.2, 
0.4, and 0.6 (from top to bottom). With gene duplication 
the initial rate of gene adaptation is faster then in the single 
gene case. However, at longer times the single gene case re- 
sults in genes that are closer to the target sequence T. Other 
parameters are the same as given in Fig. HI 
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FIG. 6: (Color online). Average gene fitness (p) with and 
without gene duplication for simulations with the fitness func- 
tion given by Eq. [l] With gene duplication the initial rate of 
gene adaptation is faster than in the single gene case. How- 
ever, at longer times the single gene case results in genes that 
are closer to the target sequence T. Parameters are the same 
as given in Fig. [4] Note that the horizontal axis scale differs 
from that shown in Fig. |4] 



and [6] hint at the important role of population variance 
in adaptation [42l|43]. In particular, notice that the vari- 
ance in gene number and in gene fitness in Figs.[4]and|6j 
respectively, both play a role in the additional popula- 
tion variance seen in Fig. [S] This results in a organismal 
fitness variance that is greater in the case with gene du- 
plication than without despite a larger gene variance in 
the single gene case. 



IV. PARAMETER CHOICE 

In order to interpret these simulation results in terms 
of their impact on biological evolution, it is important to 
consider how to map the parameters from simulation to 
those of real biology. A direct mapping where simulation 
models biological processes at biological rates would be 
one particular solution. However, the microbial world is 
one of very large numbers, making this approach unfea- 
sible. Thus, it is often best to turn to other approaches 
such as scaling analysis in order to estimate how a sys- 
tem will behave at a parameter setting that is far from 
computationally tractable. 

The parameters chosen for the simulations described 
above are from matching those of real biology. Re- 
alistically, a microbe contains around 1000 genes, and 
the rates of mutation (10~^) and gene duplication (1) 
given by Ref. [H] describe per organism rates per gen- 
eration. Our simulation is intended to model a partic- 
ular slice of the genome, representing how one particu- 
lar gene within the cell might be subject to particular 
conditions and subsequently gene family expansion and 
contraction within a population. Thus, the biologically 
relevant regime would be a mutation rate of 10~^^ with 
duplication rates of 10^'^. This is fairly far from our 
choice of 10^^ and 1 for these two parameters, respec- 
tively. The rates of these basic parameters influence the 
timescales within the simulations. Thus we can expect 
that realistic parameter range simulations would require 
at least 10 orders of magnitude longer simulation times 
based on the mutation rates alone. Even then, these sim- 
ulations are still cannot be considered realistic in light of 
microbial populations that easily number in the billions. 
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FIG. 8: (Color online). Rescaled plot of organismal fitness for 
different mutation rates without gene duplication. Average 
organismal fitness (F) across fOO simulations plotted against 
a rescaled x-axis of generations times mutation rate. Other 
parameters are the same as given in Fig. |4] This plot shows 
that the rate of organismal adaptation scales in proportion to 
the rate at which point mutations are produced. 



FIG. 9: (Color online). Rescaled plot of organismal fitness 
for different mutation rates with gene duplication. Average 
organismal fitness (F) plotted against a rescaled x-axis of gen- 
erations times mutation rate. Other parameters are the same 
as given in Fig. HI This plot shows that the rate of organis- 
mal adaptation scales in proportion to the rate at which point 
mutations are produced. 



Since simulations of such size are not reasonably fea- 
sible, we instead focus on understanding how the behav- 
ior of the system scales. In particular, mutation rate 
and system size differ dramatically from biologically rel- 
evant parameters, so we will focus our scaling analysis 
on these two parameters. Figs. [S] and |9] show that the 
rate of adaptation scales in proportion to the mutation 
rate. This verifies that our above results qualitatively 
represent those we would obtain for realistic biological 
parameters with a simple multiplicative shift being the 
main difference. 



Fig. 10 shows that our results scale approximately with 
the logarithm of the system size. Again, this scaling in- 
dicates that the speed up in adaptation presented here 
qualitatively hold for much larger systems, making it a 
realistic pathway for biological evolution. 



V. ALTERNATE FITNESS FUNCTIONS 

The above model and results represent approximations 
to a possible biological mechanism for generation of novel 
gene functions. We have attempted to keep our model 
close to that of Bergthorsson et al. [21, in order to under- 
stand the plausibility of the gene duplication mechanism 
they propose. Any study of plausibility, however, should 
also contain some measure of sensitivity analysis. The 
sensitivity of the model to specific choice of parameters 
is already discussed above. This same sensitivity check 
should also be applied to our modeling choices. 

In order to probe this, we put forth the following fitness 
function F" . 
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FIG. 10: (Color online). Rescaled plot of organismal fitness 
for different system sizes with and without gene duplication. 
Average organismal fitness (F) plotted against a rescaled x- 
axis of generations times the natural logarithm of the system 
size. 'No Duplication' indicates a duplication rate of 0.0 and 
'With Duplication' indicates a duplication rate of 1.0. Muta- 
tion rate used in both cases was 0.001. Other parameters are 
the same as given in Fig. H] This plot shows that the rate of 
organismal adaptation scales in proportion to the logarithm 
of the system size. 
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Eq. [3] alters the linear dependence of individual gene fit- 
nesses given by Eq. [2] to a squared dependence. This 
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TABLE I: Summary of number of simulation runs with fit- 
ness function T^ that reach the target sequence. In order to 
compute this, the average gene fitness was taken at the end 
of each run (10,000 generations) and were considered to have 
evolved sufficiently toward the target sequence if {g) > 0.5. 
The genomes are initialized identically, with each initial gene 
being chosen at random. As the initial number of genes grows, 
the probability of a single gene being of sufficient benefit to 
outweigh it's cost increases. Thus, the probability of success 
with gene duplication increases with initial gene number ac- 
cordingly. However, in the case without duplication, since 
these genomes can neither duplicate nor delete genes, the ex- 
tra initial genes only provide an extra fitness cost. Over 99% 
of the cases were (p) > 0.9 or (g) < 0.01. 



particular fitness function was chosen because it as we 
increase the power to which the individual fitness g is 
raised, we weaken the effect of the continuous selection 
proposed by Bergthorsson et al [21j . In other words, since 
squaring smaller numbers reduces their value by a greater 
fraction than for larger numbers closer to 1, we have are 
modeling a weaker initial catalytic "side-effect" than in 
Eq. [l] Indeed, sometimes, the benefit of the initial genes 
outweighs their associated fitness costs, leading to a num- 
ber of simulations that never improve in fitness. Table IT] 
shows the number of cases out of 1,000 that reach the 
target sequence by the end of the run. 

Figs. [TT] [12] and [13] show the behavior of the model 
under the fitness function given by Eq. [3] All three plots 
show qualitatively similar properties to those for given 
in section |III| for the fitness function in Eq. [T] Specifi- 
cally, the rise and fall in gene number in the case with 



gene duplication in Fig. 11 and the faster initial rises in 
organismal and gene fitness in Figs. [12] and [13] closely 
resemble the results of the previous model. 

Increasing the power to which g is raised in the fitness 
function to 3, 
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FIG. 11: (Color online). Average gene number (n) with and 
without gene duplication for simulations with fitness function 
given by Eq. [3] Average was taken over the 418 and 1000 
simulation runs that adapted to the target sequence with and 
without gene duplication, respectively. In the initial phases of 
adaptation, gene duplication dominates as the primary mode 
for enhancing average fitness. As time passes, the slower mode 
of adaptation provided by sequence mutation refines the genes 
and the average number of genes per genome decreases. In 
the case with no gene duplication or deletion the gene number 
remains constant. Parameters not mentioned above are the 
same as given in Fig. [4] 
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until the benefit from an additional gene will only out- 
weigh the cost for genes that exactly match the target 
sequence T. Retaining results that are qualitatively sim- 



ilar to those presented in Section III requires some form of 



continuous selection that allows for fitness contributions 
from multiple genes, supporting the hypothesis that these 
features play a key role in the evolution of novel gene 
function [5T]. Removing either of these two attributes 
entirely nullifies the benefits of gene duplication (data 
not shown). However, our results do not depend very 
strongly on the strength of either of these features, and 



n^ + rid 



(4) 



our findings outlined in Section III do not appear sensi- 
tive to the specific form of the fitness function chosen, as 
long as they fall within a class of bound fitness functions 
that allow multiple gene contributions and some form of 
continuous selection. 



further reduces the effect of continuous selection, but 
again results in qualitatively similar plots, though with 
a larger fraction of cases with gene duplication that do 
not adapt toward the target gene (data not shown). 

Ultimately, this reduction of effect continues as we 
raise the power of g in the fitness function 



VI. DISCUSSION 

Significantly, the results for the case with gene dupli- 
cation outperformed the single gene case. Continuous 
selection, especially in the absence of noise, provides a 
means for rapid uphill adaptation. Given that selection 
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FIG. 12: (Color online). Average organismal fitness {J-'^) 
with and without gene duplication. Average was taken over 
the 418 and 1000 simulation runs that adapted to the tar- 
get sequence with and without gene duplication, respectively. 
Adaptation with gene duplication is faster then in the single 
gene case. However, at long times the single gene case catches 
up to the gene duplication case. Parameters are the same as 
given in Fig. |4] 



pressure on any one gene is weaker when there are multi- 
ple copies, adaptation of genes could have arguably been 
slower with gene duplication. We did not observe this to 
be the case and found that gene duplication has an initial 
advantage over the single gene case. Note that this is for 
the specific case where there is a large difference between 
initial and the target sequences and not true for the case 
where only smaller adjustments are required to reach the 
target sequence. Also, we did not observe monotonic 
gene duplication with a non-zero gene penalty. Benefi- 
cial genes were able to proliferate, eventually resulting in 
a reduction in the average number of genes per genome. 
That gene number rises and falls according to fitness, 
overriding the duplication and deletion rates for a broad 
range of parameters, shows us that the selective advan- 
tage, as we represent it here, is enough to overcome these 
intrinsic rates. If it were otherwise, we would not be able 
to posit gene amplification as the driving dynamic behind 
changes in IS density. 



A. IS Density Following Host-Restriction 

IS elements and gene duplications go hand-in-hand 
with one another. IS elements copy and paste segments 
of the genome from one location to another location and 
thus are vehicles of gene duplication [8| l9l . Enhancing 
gene duplication rates allows an organism to take better 
advantage of the mode of adaptation described in this 
work. 

Host-restriction is defined as the process by which a 
previously free-living organisms becomes an obligate or- 
ganism, i.e., the transition from a more independent or- 



FIG. 13: (Color online). Average gene fitness {g) with and 
without gene duplication for simulations with the fitness func- 
tion given by Eq. [S] Average was taken over the 418 and 1000 
simulation runs that adapted to the target sequence with and 
without gene duplication, respectively. Note that the gene 
duplication scheme requires an initial gene with greater fit- 
ness in order to adapt toward the target sequence T. With 
gene duplication the initial rate of gene adaptation is faster 
than in the single gene case. Despite the initial advantage 
present in the gene duplication case, at longer times (g) for 
the single gene case reaches similar levels. Parameters are the 
same as given in Fig. |4] 



ganism to an organism dependent on a host for survival. 
Genome conrparisons among Buchnera indicate that this 
process involves an initial period of massive deletions, 
large scale rearrangement, and the proliferation of repet- 
itive elements followed by extreme stability and a slow 
loss of additional genes T6] . The pattern of repetitive el- 
ement proliferation is consistent with Fig. |4] in which we 
see an initial spike in IS density (given by gene number 
g) followed by a slow decrease. 

Wide surveys of genomes reveal that organisms with 
recently formed obligate associations show an increased 
level of IS density in comparison to free-living organ- 
isms |15| . Conversely, ancient obligate organisms gen- 
erally show a much lower IS density in comparison to 
free-living organisms |15j . In the context of our model, as 
IS elements proliferate, they grow in number and overall 
density within a genome. The pattern of boom and bust 
in IS density seen in the literature [TSHTT] corresponds 
to a cycle of rapid adaptation to a new environment. 
The level of transposon density in ancient obligate or- 
ganisms is then predicted by the long-time asymptotics 
of the simulations in Fig. |4] Note that we do not see 
a reduction from 2 genes to 1 on the timescales of our 
simulation due to the fact that this requires a combi- 
nation of point mutations to occur in order to increase 
fitness, which then makes the n = 2 to n — 1 transition 
very rare and slow. On the other hand, this qualitatively 
matches the finding that only ancient obligate organisms 
(i.e., after long times) exhibit exceedingly low transpo- 
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FIG. 14: (Color online). Gene number (n) with gene duplica- 
tion under changing and constant environments. The fitness 
function is given by Eq. IT] Example trajectory for changing 
environments was plotted by changing the target sequence T 
to a random sequence at randomly selected intervals according 
to a 0.04 probability of change per generation. The curve for 
constant environment is comprised of the averaged data from 
Fig. |4] and is plotted here for comparison purposes. Other 
parameters are the same as given in Fig. [4] Notice that the 
number of genes in the changing environment are generally 
larger than in the case for a constant environment. 



son densities. In our simulations, we mimicked such a 
rapid change by choosing random target and starting se- 
quences. In other words, organisms that are newly intro- 
duced into the host environment must undergo sizable 
adaptive changes in order to better compete and survive 
in their new conditions. The reduced number of trans- 
posable elements and gene duplications after long times 
comes from the long-time consistency that the host envi- 
ronment provides. Lastly, free-living organisms face vary- 
ing environmental challenges as they migrate and their 
open-system environments change. From time to time, 
challenges requiring rapid changes arise and IS density 
increases rapidly but decreases slowly. Our model posits 
that due to the occasional occurrence of these challenges, 
the IS density of free-living organisms never quite falls to 
the same level as that of the ancient obligate organisms. 
Fig. [M] shows how changes in the environment might re- 
sult in an increase in the number of transposons within 
a genome that is dependent on the frequency and mag- 
nitude of these changes. Fig. [T4| shows the case of many 
rapid and dramatic changes to environment in order to 
better highlight this dynamic and differentiate it from the 
case with constant environment. Free living organisms 
may generally undergo fewer of these changes or changes 
of smaller magnitude than the initial change that occurs 
when an organism initially enters a new host. This may 
make sense given that the host environment introduces 
many new factors including immune factors that free- 
living organisms have not dealt with previously. 



B. The Role of Noise 

The additive noise present in the fitness function given 
by Eq. (IT]) plays an important role in determining the 
relative advantage of gene amplification over single gene 
evolution. In other words, noise determines how ad- 
vantageous gene duplication will be for a population. 
The greater the noise, the more difficult the evolution- 
ary problem of finding the optima becomes. At the same 
time, these more difficult evolutionary problems are par- 
ticularly suited for gene duplication. 

Consider for a moment the role and source of noise. 
Noise represents the coupling between the genome and 
the organismal fitness — the more noise, the weaker the 
coupling, and vice-versa. In principle, noise can arise 
from several sources. For instance, environmental fluctu- 
ations may destroy an organism and kill without regard 
for the organismal phenotype. Conversely, it is possible 
that every organism has an almost equal probability of 
reproducing at any given timepoint (even though on av- 
erage the fitter organism will still retain a fixed advantage 
given by the noiseless fitness function). Nonetheless, this 
change still dramatically alters the timescale on which 
selection acts. Noise is thus not a proxy of environmen- 
tal harshness, but instead of the sensitivity of selection, 
or selectivity. 

Selection favors phenotypes that grow, survive, and re- 
produce more prolifically than their neighbors. If small 
changes in genotype result in large changes to the sur- 
vival and reproduction of the organism, then the cou- 
pling between genotype and selection can be said to have 
a stronger effect than noise. Note that the overall rates 
of reproduction do not matter, but instead, competition 
is the dominant factor. Selection pressure is not a proxy 
for the harshness of the environment. An environment 
that kills indiscriminately is just as selective as an envi- 
ronment that allows indiscriminate growth within a finite 
capacity. 

We must now differentiate between the selectivity (or 
conversely, the noise) of the system from the average envi- 
ronmental conditions or directionality of selection. When 
the environment changes on longer time scales, the direc- 
tion of selection changes. We say that an environment 
is relatively stable or consistent when the average en- 
vironment remains essentially constant over time with 
small fluctuations. However, these environmental fluc- 
tuations are not the same as noise in our model, which 
represents small scale fluctuations that essentially ran- 
domize an organism's probability of survival or repro- 
duction. Thus, we regard the host-restriction that ne- 
cessitates rapid adaptation to the host environment as 
different from the selectivity of the environment. 

Indeed, in our model the noise or selectivity of the sys- 
tem remains an important, but separate, contributor. An 
alternate explanation for the boom in IS density following 
host-restriction can be seen in Fig. [7] As the noise pa- 
rameter d increases, the advantage of gene duplications 
increases. Thus, it is possible that simply by entering 



a noisier environment one should see an increase in the 
number of IS elements. However, this explanation cannot 
account for the later decrease in IS density in the ancient 
obligate organisms. 



VII. CONCLUSION 
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vironmental change. Although our discussion has pri- 
marily focused on host-restriction, genomic surveys can 
track transposon number along environmental gradients, 
for example as a function of depth in the ocean[33J H5] . 
and our results should be relevant to interpretation of 
these data|46l. 



We have presented a model for the dynamics of a pop- 
ulation of microbial genomes following a change in en- 
vironmental conditions. Our results indicate the advan- 
tages of higher IS density in accelerating the process of 
adaptation to different environmental conditions, and the 
ensuing decrease in IS density during subsequent restabi- 
lization of the environment. This corroborates evidence 
from observational bioinformatics[T5] that indicates in- 
creased IS density following host-restriction — a large en- 
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